Model architecture can transform catastrophic forgetting into positive transfer

The work of McCloskey and Cohen popularized the concept of catastrophic interference. They used a neural network that tried to learn addition using two groups of examples as two different tasks. In their case, learning the second task rapidly deteriorated the acquired knowledge about the previous one. We hypothesize that this could be a symptom of a fundamental problem: addition is an algorithmic task that should not be learned through pattern recognition. Therefore, other model architectures better suited for this task would avoid catastrophic forgetting. We use a neural network with a different architecture that can be trained to recover the correct algorithm for the addition of binary numbers. This neural network includes conditional clauses that are naturally treated within the back-propagation algorithm. We test it in the setting proposed by McCloskey and Cohen and training on random additions one by one. The neural network not only does not suffer from catastrophic forgetting but it improves its predictive power on unseen pairs of numbers as training progresses. We also show that this is a robust effect, also present when averaging many simulations. This work emphasizes the importance that neural network architecture has for the emergence of catastrophic forgetting and introduces a neural network that is able to learn an algorithm.


The mathematical model
Addition of binary numbers. Computers are already programmed to perform the addition of binary numbers, we would like to define a NN that can change its internal weights to find an equivalent algorithm. We can recast the algorithm of addition of binary numbers in a way that resembles the typical structure of NNs. Figure 1 represents the algorithm for the addition of two binary numbers ( N 1 and N 2 ). It outputs the correct result N 4 and it also creates an additional array, N 3 , keeping track of the ones that are carried to the next step. Each of these binary numbers ( N m ) are composed of binary digits ( {1, 0} ) that we notate N n m , where n is an index indicating the position of the binary digit.
To add two binary numbers-see Fig. 1-, the operator U ijkl acts secuencially on the input arrays ( N n 1 , N n 2 and N n 3 ) and outputs two numbers at each step, U N n 3 N n 2 N n 1 l , that correspond to N n+1 3 and N n 4 , where N 4 is the result of the sum. In this way, U ijkl acts similarly to a convolutional filter, with two main differences: • Instead of performing a weighted average of N n 1 , N n 2 and N n 3 , it uses their value to choose one option (it is performing three conditional clauses).
• One component of the output ( U N n 3 N n 2 N n 1 0 ) goes into the next line of the input ( N n+1 3 ). There is information from the previous step that goes into the next one. See Fig. 1 for an example. We identify N n 3 , N n 2 and N n 1 as the ijk indices in U ijkl . In the first step, N 0 1 = 1 , N 0 2 = 1 and N 0 3 = 0 , and the output is U 011l = (1, 0) , where (1, 0) correspond to the next empty spaces of N 3 and N 4 ( N 1 3 and N 0 4 ). Iteratively applying this operator we get the correct values for N 3 and N 4 . If N 1 and N 2 have Z digits, N 3 and N 4 will have Z + 1 digits and in the last line we just identify N Z+1 In the example shown in Fig. 1 this algorithm gets the correct answer for 29 + 43 = 72.
Building an algorithmic neural network. We can define now the architecture of our neural network.
We would like to have U ijkl as a specific (learned) state in its parameter space. For simplicity, we define our NN using a modified operator, Û ijkl : where S() stands for the sigmoid operator, and a ijkl , i, j, k, l ∈ {0, 1} are 16 parameters that have to be learned. Our neural network takes the operator Û ijkl and applies it to the input numbers ( N 1 and N 2 ) as shown in Fig. 1 for U ijkl . In this process the network creates two new arrays corresponding to N 3 and N 4 . The predicted answer to the input addition is N 4 , then let us use the notation N 4 = F (N 1 , N 2 , a ijkl ) to highlight that our neural network is a nonlinear funtion of inputs N 1 and N 2 (1) U ijkl = S(a ijkl ), Figure 1. Addition of two binary numbers. U ijkl can be understood as a convolutional operator (different from the classical convolutional filters). It is applied to the input ( N 1 , N 2 ) line by line. Line nth of the input has values N n 3 , N n 2 and N n 1 that correspond to indices ijk respectively. The chosen array U N n 3 N n 2 N n 1 l provides the output N n+1 3 and N n 4 . N 4 corresponds to the result of N 1 + N 2 , in this case 29 + 43 = 72 . Û ijkl corresponds to the operator used in our NN, with parameters a ijkl that are learned through training. Û ijkl acts on N n 3 , N n 2 and N n 1 analogously to U ijkl , although in this case N n 3 can be a real number different from 0 or 1, in that case the output is a combination of both options, weighted with N n 3 . See the main text for more details. www.nature.com/scientificreports/ and of the parameters a ijkl . There is a fundamental difference with the previous case, now the digits of N 3 and N 4 will be real numbers between 0 and 1 and we have to decide how to apply Û ijkl when N n 3 is different from {0, 1}. If we apply the operator Û ijkl to the same example shown in Fig. 1, the first line is again N 0 1 = 1 , N 0 2 = 1 and N 0 3 = 0 and the output now will be N 1 3 = S(a 0110 ) and N 0 4 = S(a 0111 ) . In the second line we find now N 1 1 = 0 , N 1 2 = 1 and N 1 3 = S(a 0110 ) , since S(a 0110 ) is a real number between 0 and 1, we compute the output of this line as a combination of Û 010l and Û 110l in the following way: where we use the dot product, · , for clarity. In this way, if N 1 3 = S(a 0110 ) is exactly equal to 0 or 1 we recover the theoretical algorithm shown in Fig. 1, and when S(a 0110 ) ∈ (0, 1) the output is a combination of both options weighted by S(a 0110 ) . Applying this operator iteratively, the output of our neural network is computed, To define a learning process we create a loss function. In the simplest case, we would like to learn the addition of two specific numbers ( N 1 + N 2 ), we can define the loss function as: where Ñ 4 is the correct result of the sum N 1 + N 2 , and Ñ n 4 and F (N 1 , N 2 , a ijkl ) n are the nth binary digit of Ñ 4 and F (N 1 , N 2 , a ijkl ) , respectively. To compute F (N 1 , N 2 , a ijkl ) our model uses sums, multiplications and a nonlinear function, S(), resembling the standard convolutional filters extensively used in deep neural networks. Finally, we can differentiate L with respect to the parameters of the neural network ( a ijkl ) using standard methods 36 . Training the NN consists of changing the value of the parameters ( a ijkl ) following − ∂L ∂a ijkl , such that the loss function is minimized.

Learning the ones and twos addition facts: the problem proposed by McCloskey and Cohen.
In the problem proposed by McCloskey and Cohen, a neural network is trained using two tasks: the "ones addition facts" (all the additions of 1 with another digit, 1 + 1 = 2 through 9 + 1 = 10 and 1 + 1 = 2 through 1 + 9 = 10 ) and the "twos addition facts" ( 1 + 2 = 3 through 9 + 2 = 11 and 2 + 1 = 3 through 2 + 9 = 11 ). In their work, the neural network catastrophically forgets the first task when training on the second one. We aim to show that this was due to an inadequate choice of model architecture and that a different architecture, such as the one proposed in this work, will not display catastrophic forgetting when training on these tasks.
We create two datasets, one for the ones and another for the twos addition facts. We will train the model using the ones addition facts first, then we will continue training the same model using the twos addition facts. We create a loss function for each of these datasets, for each pair of numbers N 1 and N 2 we compute the correct result N 1 + N 2 =Ñ 4 and compute their corresponding loss, using equation (5). The total loss for each of the tasks is the average value of Eq. (5) evaluated for each pair of numbers in the task (e.g. 1 + 2 = 3 through 9 + 2 = 11 and 2 + 1 = 3 through 2 + 9 = 11 ). We train the network using gradient descent with learning rate equal to 1. We train for 2000 steps using the "ones addition facts" loss and for another 2000 steps using the "twos addition facts" loss. In Fig. 2 we plot both losses during the learning process to quantify the performance of the model for both tasks. Figure 2a shows that the loss functions corresponding to both tasks greatly decrease when training on the "ones addition facts": learning the "ones addition facts" has a positive transfer to the "twos addition facts". Similarly, both loss functions keep decreasing when training on the "twos addition facts": there is no catastrophic forgetting and the model shows positive backward transfer from the new task to the previous one. Figure 2b shows the evolution of the parameters of the network. At initialization, the parameters of the network ( a ijkl ) are real random numbers between −1 and 1, what leads to S(a ijkl ) being randomly distributed within 1/(1 + e) ∼ 0.27 and e/(1 + e) ∼ 0.73 . To recover the correct addition algorithm, U ijkl , the black continuous lines should saturate to one whereas the red dash lines should go to zero as learning progresses. Up to step ∼ 90 in the minimization process, we observe some lines performing non-monotonic behaviors until all of them (except for two black lines) tend to the correct values, we will term this region the non-trivial learning regime. For minimization steps larger than ∼ 100 (including when training switches to the "twos addition facts") learning continues smoothly and all the parameters (except two) approach their asymptotic vales, what we will term the trivial learning regime. The two black lines that do not approach 1 correspond to S(a 111l ) . The reason for this behavior is that the tasks used here do not contain any addition that would require these parameters, only used when N n 1 = N n 2 = N n 3 = 1 (for a specific n), for example in 3 + 3.
Learning to sum, one example at a time. We now perform a second experiment where we train on one sample at a time. Each sample consists on two integers chosen at random ( N 1 + N 2 =Ñ 4 ). This is more challenging than the example of the previous section. Before, the system was trained on a group of additions, the loss function was imposing more constraints on the parameters, and training led to the correct values of a ijkl . Now, the model is trained using one sample at a time, which gives more freedom to the parameters. We train on each sample during 50 steps using gradient descent on Eq. (5). Figure 3 shows the process of training the model for S(a 0110 )) · S(a 0100 ), S(a 0110 )) · S(a 0101 ),  Fig. 3) decreases for 50 steps every time that a new example is shown, this indicates that the operator Û ijkl is changing (learning) to correctly add this particular pair of numbers. When we switch to the next sample there is a sudden increment of the loss, followed by a decrease for another 50 steps. However, the test loss (contrary to the previous section) increases when training on the first samples, indicating the presence of overfitting: the NN is learning the addition of a pair of numbers, but the rules it is learning do not generalize to the rest of samples. From step ∼ 400 forward, the test loss shows a steady decrease. Panel (b) of Fig. 3 shows the value of the parameters S(a ijkl ) at all times. Similarly to the previous section, there is a non-trivial regime where parameters show non-monotonic behavior. In this case, they show more sudden transitions due to the change in training data every 50 steps. After step ∼ 400 all parameters approach their correct values leading to a trivial learning regime. One could wonder how reproducible are the results of Fig. 3, or if the learning process could get stuck in the non-trivial regime, preventing the system to reach the trivial regime and correctly learn addition. To study the robustness of this behavior we perform now 70 different simulations in the same conditions of Fig. 3 (training on one example at a time for 50 minimization steps). For each simulation we use a new random initialization and different training samples picked at random, the test dataset is the same for all the simulations. We plot the average values of the training and test loss functions in Fig. 4. The inset shows the same data but in a semi-log scale, including shaded areas around the mean values that correspond to one standard deviation of the data. Training on each sample leads to a decreasing training loss and an increasing test loss (overfitting). The average values   Fig. 4, shows a power law behavior with an exponent ∼ −1 . Fig. 3 is a good representative of the behavior of most simulations contained in Fig. 4, most of then reach the trivial regime around step ∼ 1000 . However, some outliers take much longer to find the trivial learning regime, leading to the plateau observed in the test loss between step ∼ 1000 and ∼ 5000 , when the last outlier reaches the trivial learning regime. These results prove that this model architecture shows positive transfer, changing "tasks" leads to better performance in previous and future tasks instead of catastrophic interference. When we stop training, our NN has learned the rules of addition, it will be able to sum any pair of numbers whether they were included in the training data or not.
Analysis of the results. In the cases shown in Figs. 3 and 4, we use binary numbers of dimension 5 for N 1 and N 2 . The largest number considered in the additions is 11111, 31 in decimal form. Therefore, there are (31 · 31) = 961 possible combinations to create our training and test datasets of additions. As we have seen in Figs. 3 and 4 it was enough to take around 500 minimization steps ( ∼ 10 different samples) to correctly learn addition and enter the trivial learning regime. After this, all the parameters start to asymptotically approach their theoretical value and the loss displays a power law decay with an exponent ∼ −1. Figure 2 also shows how learning in our model is divided in non-trivial and trivial learning regimes. In the former case, the evolution of the parameters ( a ijkl ) are coupled to each other, displaying non-monotonic behavior. This regime is characterized by a non-trivial exponent 0.2, for which we do not have an analytical derivation. In the later regime, the coefficients asymptotically approach their theoretical values a ijkl → ±∞ such that S(a ijkl ) → 0, 1 . The loss function that the system minimizes is a sum of terms that compare each binary digit of the correct result with the number predicted by the network, Ñ n 4 − F (N 1 , N 2 , a ijkl ) n 2 . In this regime, the dynamics of the parameters of the network are uncoupled and all behave in an equivalent manner. For simplicity, let us study the evolution of one of the parameters that we term a, in this regime there are two possibilities a → ±∞, Keeping up to leading order in e −|a| , the terms appearing in the loss function take the following forms: All the terms included in the loss function are then proportional to the ones in equation (7) and we can study the evolution of one of them. Let us take the case e 2a , a → −∞ . Our dynamics are discrete, but we assume that the continuous limit is a good approximation in the trivial learning regime. If we term t the minimization time, the parameter a evolves as we can integrate this equation as, www.nature.com/scientificreports/ where C is an additive constant that would have the information about the initial condition of a. This constant can be neglected in the limit t → ∞ . Solving (9) for a we find, indicating that, in the trivial learning regime, the parameters a ijkl tend to ±∞ in a logarithmic manner. Finally, since the loss if a sum of terms proportional to the ones in equation (7), using again the case a → −∞ , we get, Equation (11) recovers the scaling observed numerically in the trivial regime of Figs. 2 and 4, L (t) ∼ t −1 .

Discussion
The NN defined in this work is able to learn to sum any two numbers when trained on a finite set of examples. But should we still consider the addition of different pairs of numbers as different tasks? It probably depends on the NN that is being used. If the NN is only able to perform some version of pattern recognition, it will not be able to extract any common rules and training on different samples will lead to catastrophic forgetting. However, if the NN has the necessary set of tools (as shown in this work) the addition of different pairs of numbers constitute different examples of the same task, and training on all of them has a positive effect. This is similar to humans that can transfer previous knowledge when they have a deep, rather than superficial, understanding 12 .
We have defined a NN that is algorithmically aligned 37,38 with the correct algorithm for the sum of binary numbers. This NN was able to learn the correct parameters through gradient descent, when different examples of additions were used as training data. To define this NN we have created a layer that operates similarly to a traditional convolutional layer but with three important differences: • It performs different operations based on the input (equivalent to "if " clauses), instead of performing a weighted average of the input values (filter). • At every step it passes information to the next line of the input ( N n+1 3 ). • If the input ( N n 3 ) is not 0 or 1 the network combines both options of the corresponding "if " clause weighting them with N n 3 .
The mathematical operations required to apply our model to the input data, and to build the loss function, are just sums, multiplications and the sigmoid nonlinear function, similar to standard convolutional filters. This allows us to use standard back-propagation algorithms. Since our model has 16 parameters, we have not perfommed a systematic study of the runtimes. This could be necessary if this model were to be combined with standard deep learning layers.
In future work we would like to increase the complexity of this algorithmic neural network, which is able to perform different tasks depending on the input. It should be possible to combine this NN with other types of neural networks to create a complete model with the capacity to learn an algorithm at the same time that takes advantage of the power of other NNs (e.g. convolutional NNs). Hopefully, this could help "building causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems" 39 . Training these complete models can lead to a non-convex optimization problem that can benefit from changing the topography of the loss function landscape 40 , an effective way of doing this in machine learning is the use of dynamical loss functions 41 .
In the work of McCloskey and Cohen the addition of two groups of numbers were considered as two different tasks. When training on the second task the previous knowledge was erased. However, this could have been just a symptom of using a pattern-recognition neural network for the detection of seemingly different tasks. Addition is the paradigm of algorithmic tasks, which should not be learned through the memorization of a large number of examples but through finding/learning the correct algorithm. We hope these results will inspire practitioners to explore new model architectures when encountering catastrophic forgetting within standard frameworks.